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Abstract: The increasing technology of high-resolution image airborne sensors, including 
those on board Unmanned Aerial Vehicles, demands automatic solutions for processing, 
either on-line or off-line, the huge amountds of image data sensed during the flights. The 
classification of natural spectral signatures in images is one potential application. The actual 
tendency in classification is oriented towards the combination of simple classifiers. In this 
paper we propose a combined strategy based on the Deterministic Simulated Annealing 
(DSA) framework. The simple classifiers used are the well tested supervised parametric 
Bayesian estimator and the Fuzzy Clustering. The DSA is an optimization approach, which 
minimizes an energy function. The main contribution of DSA is its ability to avoid local 
minima during the optimization process thanks to the annealing scheme. It outperforms 
simple classifiers used for the combination and some combined strategies, including a 
scheme based on the fuzzy cognitive maps and an optimization approach based on the 
Hopfield neural network paradigm. 

Keywords: deterministic simulated annealing; image-based airborne sensors; classifier 
combination; fuzzy classifier; Bayesian classifier; unsupervised; spectral signatures classification 
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1. Introduction 

Nowadays the increasing technology of airborne sensors with their capabilities for capturing 
images, including those on board the new generations of Unmanned Aerial Vehicles, demands 
solutions for different image-based applications. Natural spectral signature classification is one of such 
applications because of the high image spatial resolution. The areas where the identification of spectral 
signatures are suitable include agricultural crop ordination, forest areas determination, urban 
identification and damage evaluation in catastrophes or dynamic path planning during rescue missions 
or intervention services also in catastrophes (fires, floods, etc.), among others. This justifies the choice 
of the images with different spectral signatures as the data where the proposed approach is to be 
applied, providing an application for this kind of sensors. 

All classification problems need the selection of features to be classified and their associated 
attributes or properties, where a feature and its attributes describe a pattern. The behaviour of different 
features has been studied in texture classifications [1-3]. There are two categories depending on the 
nature of the features used: pixel-based [4-6] and region-based [2,7-10]. A pixel-based classification 
tries to classify each pixel as belonging to one of the clusters. The region-based identifies patterns of 
textures within the image and describes each pattern by applying filtering (laws masks, Gabor filters, 
wavelets, etc.), it is assumed that each texture displays different levels of energy allowing its 
identification at different scales. The aerial images used in our experiments do not display texture 
patterns. This implies that textured regions cannot be identified. In this paper we focus on the pixel- 
based category. Taking into account that we are classifying multi-spectral textured images, we use as 
attributes the three visible spectral Red-Green-Blue components, i.e., the RGB colour mapping. The 
RGB map performs better than other colour representations [1 1]; we have verified this assertion in our 
experiments, justifying its choice. 

An important issue reported in the literature is that the combination of classifiers performs better 
than simple classifiers [1,12-16]. Particularly, the studies in [17] and [18] report the advantages of 
using combined classifiers against simple ones. This is because each classifier produces errors on a 
different region of the input pattern space [19]. 

Nevertheless, the main problem is: what strategy to choose for combining individual classifiers? 
This is still an open issue. Indeed in [13] it is stated that the same method can work appropriately in 
one application and produce poor results in another. Hence, our goal is to find a combined strategy that 
works conveniently for classifying spectral signatures in images. In [15] and [20] a revision of 
different approaches is reported including the way in which the classifiers are combined. Some 
important conclusions are: 1) if only labels are available, a majority vote should be suitable; 2) if 
continuous outputs like posterior probabilities are supplied, an average or some other linear 
combinations are suggested; 3) if the classifier outputs are interpreted as fuzzy membership values, 
fuzzy approaches, such as aggregation operators, could be used; 4) also it is possible to train the output 
classifier separately using the outputs of the input classifiers as new patterns, where a hierarchical 
approach can be used [1]. 

We propose a new approach which combines two individual classifiers: the probabilistic parametric 
Bayesian (BP) approach [21] and the fuzzy clustering {FC) [21,22]. The following two phases are 
involved during any classification process: training and decision. Really, the combination of the 
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outputs provided by the two individual classifiers is carried out during the decision phase, as we will 
explain later. Given a set of training data, scattered through the tri-dimensional RGB data space and 
assuming known the number of clusters and the distribution of the samples into the clusters, both BP 
and FC individual classifiers estimate their associated parameters. Based on these estimated 
parameters, during the decision phase, each individual classifier provides for each pixel to be 
classified, a support of belonging to a cluster, BP provides probabilities and FC membership degrees, 
i.e., continuous outputs. 

Because the number of classes is known, we build a network of nodes netj for each class Wj, where 
each node i in the netj is identified as a pixel location i = (x, y) in the image which is to be classified. 
Each node i is initialized in the netj with the output probability, provided by BP, that the node belongs 
to the class Wj. This is the initial state value for the node i in the netj. Each state is later iteratively 
updated through the Deterministic Simulated Annealing (DSA) optimization strategy taking into 
account the previous states and two types of external influences exerted by other nodes on its 
neighbourhood. The external influences are mapped as consistencies under two terms: regularization 
and contextual. These terms are clique potentials of an underlying Markov Random Field model [23] 
and they both involve a kind of human perception. Indeed, the tri-dimensional scenes are captured by 
the imaging sensor and mapped in the bi-dimensional space, although the third dimension is lost under 
this mapping, the spatial grouping of the regions is preserved, and they are visually perceived grouped 
together like in the real scene. 

The above allows the application of the Gestalt principles of psychology [24,25], specifically: 
similarity, proximity and connectedness. The similarity principle states that similar pixels tend to be 
grouped together. The proximity principle states that pixels near to one another tend to be grouped 
together. The connectedness states that the pixels belonging to the same region are spatially connected. 
The proximity and connectedness principles justify the choice of the neighbourhood for defining the 
regularization and contextual terms and the similarity establishes the analogies in the supports received 
by the pixels in the neighbourhood coming from the individual classifiers. From the point of view of 
the combination of classifiers the most relevant term is the regularization one. This is because it 
compares the supports provided by the individual classifier FC as membership degrees and the states 
of the nodes in the networks, which, as aforementioned, initially are the probabilities supplied by the 
individual classifier BP as supports. Therefore, this is the term where the combination of classifiers is 
really carried out making an important contribution of this paper. 

The choice of BP and FC as the simple classifiers for the combination is based on their well tested 
performance in the literature and also in the possibility of combining continuous outputs during the 
decision phase under a mechanism different from the classical one used in [15]. Nevertheless, different 
classifiers providing continuous outputs or some others where this can be obtained could be used. As 
mentioned before, we have focused the combination on the decision phase; this implies that other 
strategies that apply the combination based on the training one are out of the scope of this paper. One 
of them is proposed in [26], which has been used in various classification problems. In this model, a 
selector makes use of a separate classifier, which determines the participation of the experts in the final 
decision for an input pattern. This architecture has been proposed in the neural network context. The 
experts are neural networks, which are trained so that each network is responsible for a part of the 
feature space. The selector uses the output of another neural network called the gating network [15]. 
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The input of the gating network is the pattern to be classified and the output is a set of outputs 
determining the competences for each expert. These competences are used together during the decision 
with the classifier outputs provided by the experts. Under the above considerations we justify the 
choice of BP and FC as the base classifiers for the proposed combined strategy. 

We have designed similar combined strategies. The first one is based on the fuzzy cognitive maps 
(FCM) framework [27] and the second in the analog Hopfield neural network (HNN) paradigm [28], 
where in the latter an energy minimization approach is also carried out. The best performance 
achieved, considering both strategies, is about an 85% success. After additional experiments with the 
HNN, we have verified that this is because the energy falls some times in local minima that are not 
global optima. This behaviour of HNN is reported in [29]. The DSA is also an energy optimization 
approach with the advantage that it can avoid local minima. Indeed, according to [23] and reproduced 
in [29], when the temperature involved in the simulated annealing process satisfies some constraints 
(explained in the section 2.2) the system converges to the minimum global energy which is controlled 
by the annealing scheduling instead of the nonlinear first-order differential equation used in HNN. This 
is the main difference of the proposed DSA technique with respect to the HNN approach. The FCM 
does not work with energy minimization, but because it does not improve the results of HNN, we think 
that it is unable to solve this problem. Hence, we exploit the capability of the DSA for avoiding local 
minima, making the main contribution of this paper. The DSA outperforms the FCM and HNN 
combined strategies, also the classical ones and the simple classifiers. 

The paper is organized as follows. In Section 2 we give details about the proposed combined 
classifier, describing the training and decision phases, specially the last one where the DSA mechanism 
is involved. In Section 3 we give details about the performance of the proposed strategy applied to 
natural images displaying different spectral signatures. Finally, the conclusions are presented in 
Section 4. 

2. Design of the Classifier 

The system works in two phases: training and decision. As mentioned before, we have available a 
set of scattering patterns for training, partitioned into a known number of classes, c. With such 
purpose, the training patterns are supplied to the BP and FC classifiers for computing their parameters. 
These parameters are later recovered during the decision phase for making decisions about the new 
incoming samples, which are to be classified. 

2.1. Training Phase 

During the training phase, we start with the observation of a set X of n training samples, i.e., 
X = {x v x 2 ,...,x n } e dl d , where d is the data dimensionality, which is set to 3 because the samples 

represent the R,G and B spectral components of each pixel. Each sample is to be assigned to a given 
class Wj, where the number of possible classes is c, i.e. ,7 = \,2,...,c. 
a) Fuzzy Clustering (FC) 

This process receives the input training patterns and computes for each x i e X at the iteration t its 
membership grade /// and updates the class centres, v e SR rf as follows [20,22]: 
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J( t + l) = 1 • V (^1) = M^ (1) 

1,44(0/4(0) ^'=i^ w 

4 = ^ 2 (*., v y . J is the squared Euclidean distance between x t and Vj and equivalently 4> between x, and 

v r . The number m is called the exponent weight [22,30]. The stopping criterion of the iteration process 
is achieved when ||^(^ + 1)-/4 J (0|| < s Vzy or a number t max of iterations is reached, set to 50 in our 

experiments; s has been fixed to 0.01 after experimentation. Once the fuzzy clustering process is 
carried out, each class Wj has associated its centre Vj. 
b) Bayesian Parametric (BP) estimation 

Assuming known the distribution (Gaussian) for each class Wj, the probability density function is 
expressed as follows: 

(2tt) |C.| L * J 

where the parameters to be estimated are the mean m 7 and the covariance Q, both for each class Wj 
with nj samples. They are estimated through maximum likelihood as given by equation (3): 

m J=^h c J = 7^ l U x >- m >K x >- m > ) > T (3) 

where T denotes transpose. The parameters Vj, nij and Cj are stored to be recovered during the next 
decision phase. 

2.2. Decision Phase 

Given a new sample jc„ the problem is to decide which the cluster it belongs is. We make the 
decision based on the final state values after the DSA optimization process. As mentioned before, the 
DSA is an energy optimization based approach with the advantage that it can avoid local minima. Indeed, 
in accordance with [23] and reproduced in [29], when the temperature involved in the simulated 
annealing process satisfies some constraints, explained below, the system converges to the minimum 
global energy which is controlled by the annealing. The minimization is iteratively achieved by 
modifying the state of each node through the external influences exerted by other nodes and its own state 
on the previous iteration. 

As mentioned during the introduction, for each cluster wj, we build a network of nodes, netj. Each 
node i in the netj is associated to the pixel location i = (x, y)in the image, which is to be classified; the 
node i in the netj is initialized with the probability pj = p\x i \ WjJ provided by BP according to the 

equation (2), but mapped linearly to the range [— 1,+1] instead of [0,+l]. The probabilities are the 
initial network states associated to the nodes. As it is known, the simple BP method classifies each 
pixel i as belonging to the cluster Wj according to the maximum network state value associated to the 
pixel i in the j networks, i.e., i e w, if pj > p\ , V j ^ h. Through the DSA these network states are 

reinforced or punished iteratively based on the influences exerted by their neighbours. The goal is to 
make better decisions based on more stable state values. 
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Suppose a network with N nodes. The simulated annealing optimization problem is: modify the 
analogue values pj so as to minimize the energy [21,29]: 

1 c N N (4) 

E = -~TTTsj k pjpj 
I j=\i=i k=\ 

where s{ k is the symmetric weight interconnecting two nodes i and k in the netj and can be positive or 
negative ranging in [-1,+1]; pj is the state of the neighbouring node k in the netj. Each^ determines the 
influence that the node k exerts on i trying to modify the state pj . According to [21] the self-feedback 
weights must be null (i.e., sj = 0 ). The DSA approach tries to achieve the most network stable 

configuration based on the energy minimization. From equation (4) one can see that this expression 
requires the computation of sj k and the states of the nodes pj and pj ; sj k will be defined later in the 

equation (7); both pj and pj are obtained after the corresponding updating process. 

The term sj k is a combination of two coefficients representing the mutual influence exerted by the k 

neighbours over i, namely: a) a regularization coefficient which computes the consistency between the 
states of the nodes and the membership degrees provided by FC in a given neighbourhood for each 
netf, b) a contextual coefficient which computes the consistency between the class labels obtained after 
a previous classification phase. Both consistencies are based on the similarity Gestalt's principle 
[24,25], as explained in the introduction. The neighbourhood is defined as the m-connected spatial 
region, N™ , where m is set to 8 in this paper and allows the implementation of the proximity and 

connectedness Gestalt's principles [24,25], also explained in the introduction The regularization 
coefficient is computed at the iteration t according to the equation (5): 

\-\pj{t)- M i\ k^N™, i*k (5) 
0 k£N? or i = k 



4(t) = 



where juj is the membership degree, supplied by FC, that a node (pixel) k with attributes xt belongs to the 

class Wj, computed through the equation (1). These values are also mapped linearly to range in [— 1,+1] 
instead of [0,+l]. From (5) we can see that tf k (t) ranges in [— 1,+1] where the lower/higher limit means 

minimum/maximum influence respectively. 

The contextual coefficient at the iteration t is computed taking into account the class labels h and /, as 
follows, where values of -1 and +1 mean negative and positive influence respectively: 



%(0 = 



+1 ((0 = 4(0 keN™, i*k 

-1 (.(0*4(0 keN™, i*k (6) 
0 k£N™, i = k 



Labels U and 4 are obtained as follows: given the node i, at each iteration t, we know its state at 
each netj as given by the next equation (8), initially through the supports provided by BP; we 
determine that the node i belongs to the cluster Wj if pj >P;,Vj^h,sowe set h to the j value which 

identifies the cluster,^ = 1,..., c. The label 4 is set similarly. Thus, this coefficient is independent of the 
netj, because it is the same for all networks. Both coefficients are combined as the averaged sum, taking 
into account the signs: 
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W i i(t) = yrf k (t) + {\-y)c ik (t); 4 =[sgn(w£)T ; sgn{w^) = 



-1 W/ k <0 (7) 



[+1 W/ k > 0 



y e [0, 1] represents the trade-off between both coefficients. After a set of experiments we have chosen 
y = 0.80 because c ik (t) considers the state values which are directly involved in the energy computation 

through the equation (4). This avoids the over contribution of the state values in the energy value; sgn is 
the signum function and v is the number of negative values in the setC = {Wj J k (t),rf k (t),c ik (t)} , i.e., 

givenS = {geC/^<0}cC,v = card (5). Note that c ik (t) after a previous decision phase. 

The simulated annealing process was originally developed in [31,32] under a stochastic approach. In 
this paper we have implemented the deterministic one described in [21,33] because, as reported here, the 
stochastic is slow due to its discrete nature as compared to the analogue nature of the deterministic. 
Following the notation in [21], let u{(t) = Et^i (i)p{(t) be the force exerted on node i by the other nodes 

k e N™ at the iteration t; then the new state p\ (t + 1) is obtained by adding the fraction / (-,-) to the 

previous one as follows: 

Pi + 1) = \ [/(«! (0, T(t)) + p{ (0] = \ [tanh (u{ {t) /T (t)) + p{ (r)] (8) 

Where, as always, t represents the iteration index. The fraction /(•,•) depends upon w/(7)and the 

temperature T at the iteration t. 

The equation (8) differs from the updating process in [21] because we have added the term p J t (t) to 
the fraction /(•, •) . This modification represents the contribution of the self-support from node i to its 
updating process. This implies that the updated value for each node i is obtained by taking into account 
its own previous state value and also the previous state values and membership degrees of its neighbours. 
The introduction of the self support tries to minimize the impact of an excessive neighbouring influence. 
Hence, the updating process tries to achieve a trade-off between its own influence and the influence 
exerted by the nodes j by averaging both values. 

One can see from equation (7) that if a node i is surrounded by nodes with similar state values and 
labels, s J ik (t) should be high. This implies that the p{ (t) value should be reinforced through equation (8) 

and the energy given by equation (4) is minimum and vice versa. Moreover, at high T, the value of 
/(•,•) is lower for a given value of the forces «/ (t) . Details about the behaviour of T are given in [2 1 ] . We 

have verified that the fraction uj (t) /T (t) must be small as compared to p\ (t) in order to avoid that the 
updating is controlled only by !//(*)• Under the above considerations and based on [23,30,33], the 

following annealing schedule suffices to obtain a global minimum: T(f) = 7o/log(^+l), with To being a 
sufficiently high initial temperature. 7b is computed as follows [34]: 1) we select four images to be 
classified, computing the energy in (4) for each image after the initialization of the networks; 2) we 
choose an initial temperature that permits about 80% of all transitions to be accepted (i.e., transitions that 
decrease the energy function), and the temperature value is changed until this percentage is achieved; 3) we 

compute the M transitions AE k and we look for a value for T for which -^-^^_ i exp(-A7i t /7 7 ) = 0.8 , 
after rejecting the higher order terms of the Taylor expansion of the exponential, T = 8(AE k ) , where (•) 
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is the mean value. In our experiments, we have obtained (AE k J = 1.22, giving To = 9.76 (with a similar 

order of magnitude as that reported in [33]). We have also verified that a value of t max = 200 suffices, 
although the expected condition T(f) = 0, t — > +<x> in the original algorithm is not fully fulfilled. The 
assertion that it suffices is based on the fact that this limit was never reached in our experiments as shown 
later in the section 3, hence this value does not affect the results. The DSA process is synthesized as 
follows [21]: 

1. Initialization: load each node with pf (t = 0) according to the equation (2); set e = 0.01 (constant to 

accelerate the convergence, section 3.1); t max = 100. Define nc as the number of nodes that change 
their state values at each iteration. 

2. DSA process: 
t = 0 

while t < t max ornc^O 
t = t+ 1; nc = 0; 
for each node i 

update pf(t) according to the equation (8) from equations (5) to (7) 

if\p((t)-pi(t-i)\>s 

then 

nc = nc + 1 ; else nc = nc 
end if, end for, end while 

3. Outputs: the states pf (t) for all nodes updated. 

The decision about the classification of a node i with attributes x t as belonging to the class Wj is 
made as follows: i e w. if pf > p h t , V w. ^ w h . 

3. Comparative Analysis and Performance Evaluation 

To assess the validity and performance of the proposed approach we describe the tests carried out 
according to both processes: training and classification. First, we give details about the setting of some 
free parameters involved in the proposed method. 

3.1. Setting Free Parameters 

We have used several data sets for setting the free parameters; these are: 1) nine data sets from the 
Machine Learning Repository [35]: (bupa, cloud, glass, imageSegm, iris, magi4, thyroid, pimalndians 
and wine); 2) three synthetic data sets manually generated with different numbers of classes 
and 3) four data sets coming from outdoor natural images, also with different numbers of classes. The 
use of these data, some of them different from the images with different spectral signatures, is justified 
under the idea that the values of the parameters to be set must have so much general validity as it 
is possible. 

a) Parameters involved in the FC training phase 

They are the exponential weight m in equation (1) and the convergence parameters e and t max used 
for its convergence. The number of classes and the distribution of the patterns on the clusters are 
assumed to be known. We apply the following cross-validation procedure [21]. We randomly split 
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each data set into two parts. The first (90% of the patterns) is used as the training set. The other set 
(validation set) is used to estimate the global classification error based on the single FC classifier. We 
set m = 2.0 (which is a usual value) and vary s from 0.01 to 0.1 in steps of 0.015 and estimate the 
cluster centres and membership degrees for each training set. Then, we compute the error rate for each 
validation set. The maximum error was obtained with e = 0.1 for 10 iterations and the minimum with 
e= 0.01 and 47 iterations. Fixed those values, we vary m from 1.1 to 4.0 in steps of 0.1 and estimate 
once again the cluster centres and the membership degrees with the training set. Once again the 
validation sets are used for computing the error rates, the minimum error value is obtained for 
m = 2.0. The settings are finally fixed to m = 2.0, e = 0.01 and t max = 50 (expanding the limit of 47). 
b) DSA convergence 

The e used for accelerating the convergence in the DSA optimization approach is set to 0.01 by 
using the validation set for the four data sets coming from the outdoor natural images mentioned 
above. Verifying, that t max = 20 suffices. 

3.2. Training Phase 

We have available a set of 36 digital aerial images acquired during May in 2006 from the Abadin 
region located at Lugo (Spain). They are images in the visible range of the spectra, i.e., red-green-blue, 
512 x 512 pixels in size. The images were taken during different days from an area with several 
natural spectral signatures. We select randomly 12 images from the set of 36 available. Each image is 
down sampled by two, eliminating a row and column of every two; so, the number of training samples 
provided by each image is the number of pixels. The total number of training samples is 
n = 12 x 256 x 256 = 786,432. 

We have considered that the images have four clusters, i.e., c = 4. Table 1 displays the number of 
patterns used for training and the cluster centres estimated by the individual classifiers, which are V/ for 
FC and m, for BP, equations (1) and (3) respectively. 

Table 1. Number of patterns used for training and class centres obtained for each class 



according to the simple classifiers FC and BP. 





cluster w\ 


cluster m>2 


cluster W3 


cluster w>4 


Number of 
patterns 


139,790 


196,570 


387,359 


62,713 


BP (md 


(37.5,31.3,21.5) 


(167.0,142.6, 108.4) 


(93.1, 106.0, 66.4) 


(226.7, 191.9, 180.4) 


FC(vd 


(35.3,28.8, 19.9) 


(168.0,142.8,108.6) 


(93.0, 106.4, 66.5) 


(229.1, 194.0, 184.4) 



3.3. Decision Phase and Comparative Analysis 



The remaining 24 images from the set of 36 are used as images for testing. Four sets, SO, SI S2 and 
S3 of six images each, are processed during the test according to the strategy described below. The 
images assigned to each set are randomly selected from the 24 images available. 
a) Design of a test strategy 

In order to assess the validity and performance of the proposed approach we have designed a test 
strategy with two purposes: 1) to verify the performance of our approach as compared against some 
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existing strategies (simple and combined); 2) to study the behaviour of the method as the training (i.e., 
the learning) increases. 

Our proposed combined DSA (DS) method is compared against the base classifiers used for the 
combination (BP and FC). It is also compared against the following classical combiners that apply the 
decision as described immediately after [15,20]. Consider the pixel i to be classified. BP and FC 
provide the probability pj and membership degree juj respectively, that the pixel i belongs to the class 

wj. After applying a rule, a new support sj is obtained for that pixel of belonging to Wj as follows: a) 

Mean rule (ME) sj =(/// + pj)/^', b) Maximum rule (MA) sj =max{/uj ,/?/} ; c) Minimum rule (MI) 

sj = min | juj , pj } and d) Product rule (PD) sj = juj pj . These rules have been studied in terms of 

reliability [36]. Yager [37] proposed a multi-criteria decision making approach based on fuzzy sets 
aggregation. It follows the general rule and the scheme of the combiners described in [21]. So, DS is 
also compared against the fuzzy aggregation (FA) where the final support that the pixel i belongs to 
the class Wj is given by the following aggregation rule: 

sj '= l-/wm|l,|(l- ///)"+ (l-/?/ ) fl J V ° | a>\ (9) 

The parameter a has been fixed to 4 by applying a cross-validation procedure as the described in 
section 3.1a). Given the supports, according to each rule, the decision about the pixel i is made as 
follows: i e w . if sj > sj Vw t | w k ^ w . . 

Finally, and what it is more important, DS is compared against the optimization strategy based on 
the Fuzzy cognitive Maps (FM) [27] and the Hopfield neural Network (HN) [28] paradigms. Both are 
based on the same network topology like the used in this paper and compute the regularization and 
contextual coefficients similarly to the proposed in this paper through the equations (5) and (6), but 
using the membership degrees provided by FC for the networks initializations. Nevertheless, for 
comparison purposes, we have changed the roles in the experiments carried out here, so that the nodes 
in both FM and HN are initially loaded with the probabilities as in the proposed DS approach. 

In order to verify the behaviour of each method as the learning degree increases, we have carried 
out the experiments according to the following three STEPs described below 

STEP 1: given the images in SO and SI, classify each pixel as belonging to a class, according to the 
number of classes established during the training phase. Compute the percentage of successes 
according to the ground truth defined for each class at each image. The classified pattern samples from 
SI are added to the previous training samples and a new training process is carried out (Section 2.1) 
with the same number of clusters. The parameters associated to each classifier are updated. The set SO 
is used as a pattern set in order to verify the performance of the training process as the learning 
increases. Note that it is not considered for training. 

STEPs 2 and 3: perform the same process but using the sets S2 and S3 respectively instead of SI; 
SO is also processed as before. 

As one can see the number of training samples added at each STEP is 6 x 512 x 512 because this 
is the number of pixels classified during the STEPs 1 to 3 belonging to the sets SI, S2 and S3. 
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To verify the performance for each method we have built a ground truth for each image processed 
under the supervision of expert human criteria. Based on the assumption that the automatic training 
process determines four clusters, we classify each image pixel with the simple classifiers obtaining a 
labelled image with four expected clusters, and then we select the image with the best results, always 
according to the expert. 

The labels for each cluster, from the selected labelled image, are manually touched up until a 
satisfactory classification is obtained under the human supervision. This implies that each pixel has 
assigned a unique label in the ground truth, which serves as the reference one for comparing 
the performances. 

Figure 1(a) displays an original image belonging to the set SO; Figure \{b) displays the 
correspondence between clusters and labels, in the left column the colour according to the values of 
the corresponding cluster centre and in the right column the artificial colour labels, both in the tri- 
dimensional RGB colour space; (c) labelled image for the four clusters obtained by our proposed 
DS approach. 

The correspondence between labels and the different spectral signatures is: 1. -yellow, forest 
vegetation displaying obscure tones; 2. -blue, ochre tones without the spectral saturation of the sensor; 
3. -green, agricultural crop vegetation; 4. -red, ochre tones with a clear tendency towards the spectral 
saturation of the sensor. In clusters 3 and 4 are included buildings, man made structures and also 
bare soils. 



Figure 1. (a) original image belonging to the set SO; (b) correspondence between classes 
and labels; (c) labelled image with the four classes according to the labels in (b). 




(a) (b) 

(c) 

Figure 2 displays the distribution of a representative subset of 4,096 patterns from the image of the 
Figure 1(a), obtained by down sampling the image by eight, into the clusters in the tri-dimensional 
RGB colour space, where the centres of the classes, obtained through the BP classifier during the 
training phase, are also displayed; they are the four rrij cluster centres, displayed in the same colour as 
the labels in the Figure 1(b). As one can see, there is no a clear partition into the four clusters because 
the samples appear scattered in the whole space following the diagonal. Hence, the classification of the 
borders patterns becomes a difficult task because they can belong to more than one cluster depending 
on their proximity to the centres. 
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Figure 2. Distribution of a subset of 4,096 patterns into the four estimated classes around 
the cluster centres of the classes in the colour space RGB. The centres are displayed in the 
same colour as the labels in Figure \{b). 
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b) Results 

Table 2 shows the percentage of error during the decision for the different classifiers. For each 
STEP from 1 to 3, we show the results obtained for both sets of tested images SO and either SI or S2 
or S3. 

These percentages are computed as follows. Let I' N an image r {r = 1,...,6) belonging to the set SN 
(N= 0,1,2,3); i is the node at the pixel location (x,y) mI r N . An error counter E' N is initially set to zero 

for each image r in the set SN at each STEP and for each classifier. Based on the corresponding 
decision process, each classifier determines the class to which the node i belongs, i e w. . If the same 

pixel location on the corresponding ground truth image is black then the pixel is incorrectly classified 
and E r N = E r N + 1 . The error rate of the image I' N is: e' N =E' N /Z , where Z is the image size, i.e., 512 x 512. 

The average error rate for the set SN at each STEP is given by: 

e =Iyv ( 10 ) 

0 ,- = i 

and the standard deviation by: 




In the Table 2 they are displayed as percentages, i.e., e N = lOOe^ ando^ = lOOcr^ . The numbers in 

square brackets indicate the rounded and averaged number of iterations required by DS, FIN and FM 
for each set (SO, SI, S2 and S3) at each STEP (1, 2 and 3). 

Figure 3 displays the ground truth image for the one in Figure 1(a) which has been manually 
rectified from the results obtained through the BP classifier. As in the image of Figure 1(c), each 
colour identifies the corresponding label for the four clusters represented in Figure 1. 
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Figure 3. Ground truth image where the labels for the four clusters displayed in Figure 1 
have been manually rectified. 




Table 2. Average percentages of error and standard deviations at each STEP for the four 
sets of tested images SO, SI, S2 and S3. 



e N : average percentage of error 
<7 N : standard deviation of error 


STEP1 


STEP 2 


STEP 3 


SO 


SI 


SO 


S2 


■ 


SO 


S3 


<?o 


Co 


<?1 






Co 


e 2 


a 2 


<?o 


o"o 






Combination 
by 

optimization 
(DS, HN) 

and 
relaxation 

(FM) 


[iterations] 

DS (Simulated) 


[8] 

17.1 


1.1 


[10] 

17.8 


1.2 


[8] 

14.8 


1.0 


[8] 

13.8 


0.8 


[7] 

10.5 


0.7 


[7] 

13.5 


0.7 


[iterations] 

HN (Hopfield) 


[9] 

20.6 


1.6 


[10] 

21.5 


1.5 


[9] 

18.2 


1.2 


[8] 

17.2 


1.0 


[7] 

14.9 


0.8 


[8] 

17.2 


0.8 


[iterations] 

FM(Fuzzy C.) 


[16] 

21.6 


1.7 


[18] 

21.6 


1.6 


[14] 

19.1 


1.2 


[15] 

19.8 


1.1 


[11] 

16.0 


0.9 


[12] 

18.6 


0.8 


Fuzzy 
Combination 


FA (Yager) 


25.5 


2.2 


26.8 


2.1 


24.1 


1.9 


24.4 


1.8 


21.5 


1.6 


20.8 


1.5 


Combination 
rules 


MA 

(Maximum) 


31.2 


2.9 


30.7 


2.7 


28.4 


2.8 


27.5 


2.6 


26.9 


2.1 


26.8 


1.9 


MI (Minimum) 


37.1 


3.1 


36.9 


2.9 


32.2 


3.3 


35.2 


2.8 


30.9 


2.4 


28.5 


2.3 


ME (Mean ) 


29.1 


2.6 


28.6 


2.2 


25.3 


2.3 


26.4 


2.2 


25.5 


1.9 


24.3 


1.7 


PR (Product) 


29.5 


2.7 


29.1 


2.3 


25.8 


2.4 


27.0 


2.4 


25.2 


2.1 


25.1 


1.8 


Simple 
classifiers 


BP (Bayesian 
Parametric) 


30.2 


2.7 


29.1 


2.5 


26.1 


2.2 


26.4 


2.2 


25.2 


2.0 


24.7 


1.8 


FC (Fuzzy 
clustering) 


32.1 


2.8 


30.2 


2.6 


27.1 


2.3 


27.4 


2.3 


26.0 


2.1 


25.9 


2.0 



c) Discussion 

Based on the error rates displayed in Table 2, we can see that in general, the proposed DS approach 
outperforms the other methods and achieves the less error rates for STEP 3 in both sets SO and SI. All 
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strategies achieve the best performance in the STEP 3. Of particular interest is the improvement 
achieved for the set SO in STEP 3 with respect the results obtained in STEPs 1 and 2 for that set. Based 
on the above observations, we can conclude that the learning improves the results, i.e., better decisions 
can be made as the learning increases. A detailed analysis for groups of classifiers is the following: 

1) Simple classifiers: the best performance is achieved by BP as compared to FC. This suggests that 
the network initialization, through the probabilities supplied by BP, is acceptable. 

2) Combined rules: the mean and product rules achieve both similar averaged errors. The 
performance of the mean is slightly better than the product. This is because, as reported in [38], 
combining classifiers which are trained in independent feature spaces result in improved performance 
for the product rule, while in completely dependent feature spaces the performance is the same. We 
think that this occurs in our RGB feature space because of the high correlation among the R, G and B 
spectral components [39,40]. High correlation means that if the intensity changes, all the three 
components will change accordingly. 

3) Fuzzy combination: this approach outperforms the simple classifiers and the combination rules. 
Nevertheless, this improvement requires the convenient adjusting of the parameter a, with other values 
the results get worse. 

4) Optimization and relaxation approaches: once again, the best performance is achieved by DS, 
which with a similar number of iterations that HN obtains better percentages of successes, the 
improvement is about 3.6 percentage points. DS also outperforms FM. This is because DS avoids 
satisfactorily some minima of energy, as expected. 

For clarity, in Figure 4(a) the performance of the proposed DS approach for the set SO is displayed 
against HN, because both are optimization approaches based on energy minimization; ME which is the 
best method of the combination rules and BP, the best method of simple combiners. Figure 4(b) shows 
the energy behaviour for the four sets (SI, S2, S3 and SO in STEP 3) against the averaged number of 
iterations required to reach the convergence. The energy decreases as the optimization process 
increases, as expected according to the equation (4). Similar slopes can be observed for the sets SO, S2 
and S3. On the contrary, the slope for SI is smoother; this explains the greater number of iterations 
required for this set during the convergence. 

Overall, the results show that the combined approaches perform favourably for the data sets used. 
The MA and ME fusion methods also provide best results than the individual ones. This means that 
combined strategies are suitable for classification tasks. This agrees with the conclusion reported in [13] 
or [15] about the choice of combined classifiers. Moreover, as the learning increases through STEPs 1 
to 3 the performance improves and the number of iterations for SO decreases, because part of the 
learning has been achieved at this stage. This means that the learning phase is important and that the 
number of samples affects the performance. 

The main drawback of the DS, as well as also for the HN and FM approaches, is its execution time, 
which is greater than the methods that do not apply relaxation processes. This is a general problem for 
all kind of relaxation or optimization approaches. 

All tests have been implemented in MATLAB and executed on an Intel Core 2 Duo, 2.40 GHz PC 
with 2.87 GB RAM operating under Microsoft Windows XP service pack 3. On average, the execution 
time per iteration and per image is 10.1 seconds. 
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Figure 4. (a) percentage of error for DS, HN, ME and BP against the three STEPs; 
(b) energy behaviour for SO to S3 against the number of iterations. 



x 10 




4. Conclusions 



During the decision phase, we have proposed a combined strategy under the DSA framework 
performing favourably as compared against other existing combined strategies including those with 
similar design and based on optimization and also against the individual classifiers. The application of 
the similarity, proximity and connectedness Gestalt's principles allows combining probabilities and 
membership degrees, supplied by the BP and FC classifiers respectively, by means of the 
regularization and contextual coefficients. The probabilities supplied by BP are used as initial states in 
a set of neural networks, which are specifically designed with such purpose. These states are iteratively 
updated under the DSA optimization process through the external influences exerted by the nodes in 
the neighbourhood thanks to the application of the Gestalt's principles. 

In future works the updating through the DSA of both probabilities and membership degrees could 
be considered. With the proposed combined approach, we have established the bases to be able for 
combine more than two classifiers. This can be made by re-defining the regularization coefficient. 

Also, if we try to combine classifiers providing outputs in different ranges always it should be 
possible to map all outputs in the same range. This allows the combination of different kinds of 
classifiers including self-organizing maps or vector quantization with BP or FC by example. 

Acknowledgements 

The authors would like to thank to SITGA (Servicio Territorial de Galicia) in collaboration with the 
Dimap Company (http://www.dimap.es/) for the original aerial images supplied and used in this paper. 
The authors are also grateful to the referees for their constructive criticism and suggestions on the 
original version of this paper. 



Sensors 2009, 9 



7147 



References and Notes 

1. Valdovinos, R.M.; Sanchez, J.S.; Barandela, R. Dynamic and static weighting in classifier fusion. 
In Pattern Recognition and Image Analysis, Lecture Notes in Computer Science; Marques, J.S., 
Perez de la Blanca, N., Pina, P., Eds.; Springer Berlin/Heidelberg: Berlin, Germany, 2005; 
pp. 59-66. 

2. Puig, D.; Garcia, M.A. Automatic texture feature selection for image pixel classification. Patt. 
Recog. 2006, 39, 1996-2009. 

3. Hanmandlu, M.; Madasu, V.K.; Vasikarla, S. A Fuzzy Approach to Texture Segmentation. In 
Proceedings of the IEEE International Conference on Information Technology: Coding and 
Computing (ITCC'04), Las Vegas, NV, USA, April 5-7, 2004; pp. 636-642. 

4. Rud, R.; Shoshany, M.; Alchanatis, V.; Cohen, Y. Application of spectral features' ratios for 
improving classification in partially calibrated hyperspectral imagery: a case study of separating 
Mediterranean vegetation species. J. Real-Time Image Process. 2006, 1, 143-152. 

5. Kumar, K.; Ghosh, J.; Crawford, M.M. Best-bases feature extraction for pairwise classification of 
hyperspectral data. IEEE Trans. Geosci. Remot. Sen. 2001, 39, 1368-1379. 

6. Yu, H.; Li, M.; Zhang, H.J.; Feng, J. Color texture moments for content-based image retrieval. In 
Proceedings of International Conference on Image Processing, Rochester, NY, USA, September 
22-25, 2002; pp. 24-28. 

7. Maillard P. Comparing texture analysis methods through classification, Photogramm. Eng. 
Remote Sens. 2003, 69, 357-367. 

8. Randen, T.; Husoy, J.H. Filtering for texture classification: a comparative study. IEEE Trans. 
Patt. Anal. Mach. Int. 1999, 21, 291-310. 

9. Wagner, T. Texture Analysis. Signal Processing and Pattern Recognition. In Handbook of 
Computer Vision and Applications; Jahne, B., HauPecker, H., GeipTer, P., Eds.; Academic Press: 
St. Louis, MO, USA, 1999. 

10. Smith, G.; Burns, I. Measuring texture classification algorithms. Patt. Recog. Lett. 1997, 18, 
1495-1501. 

11. Drimbarean, A.; Whelan, P.F. Experiments in colour texture analysis. Patt. Recog. Lett. 2003, 22, 
1161-1167. 

12. Kong, Z.; Cai, Z. Advances of Research in Fuzzy Integral for Classifier's Fusion. In Proceedings 
of 8th ACIS International Conference on Software Engineering, Artificial Intelligence, 
Networking and Parallel/Distributed Computing, Tsingtao, China, July 30-August 1, 2007; pp. 
809-814. 

13. Kuncheva, L.I. "Fuzzy" vs "non- fuzzy" in combining classifiers designed by boosting. IEEE 
Trans. Fuzzy Syst. 2003, 11, 729-741. 

14. Kumar, S.; Ghosh, J.; Crawford, M.M. Hierarchical fusion of multiple classifiers for hyperspectral 
data analysis. Patt. Anal. Appl. 2002, 5, 210-220. 

15. Kittler, K; Hatef, M.; Duin, R.P.W.; Matas, J. On combining classifiers. IEEE Trans. Patt. Anal. 
Mach. Int. 1998, 20, 226-239. 



Sensors 2009, 9 



7148 



16. Cao, J.; Shridhar, M.; Ahmadi, M. Fusion of Classifiers with Fuzzy Integrals. In Proceedings of 
3rd Int. Conf. Document Analysis and Recognition (ICDAR '95), Montreal, Canada, August 14-15, 
1995; pp. 108-111. 

17. Partridge, D.; Griffith, N. Multiple classifier systems: software engineered, automatically modular 
leading to a taxonomic overview. Patt. Anal. Appl. 2002, 5, 180-188. 

18. Deng, D.; Zhang, J. Combining Multiple Precision-Boosted Classifiers for Indoor-Outdoor Scene 
Classification. Inform. Technol. Appl. 2005, 1, 720-725. 

19. Alexandre, LA.; Campilho, A.C.; Kamel, M. On combining classifiers using sum and product 
rules. Patt. Recog. Lett. 2001, 22, 1283-1289. 

20. Kuncheva, L.I. Combining Pattern Classifiers: Methods and Algorithms; Wiley: New York, NY, 
USA, 2004. 

21. Duda, R.O.; Hart, P.E.; Stork, D.S. Pattern Classification; Wiley: New York, NY, USA, 2001. 

22. Zimmermann, H.J. Fuzzy Set Theory and its Applications; Kluwer Academic Publishers: Norwell, 
MA, USA, 1991. 

23. Geman, S.; Geman, G. Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of 
images. IEEE Trans. Patt. Anal. Mack Int. 1984, 6, 721-741. 

24. Koffka, K. Principles of Gestalt Psychology; Harcourt, Brace & Company: New York, NY, USA, 
1935. 

25. Palmer, S.E. Vision Science. MIT Press: Cambridge, MA, USA, 2004. 

26. Xu, L.; Amari, S.I. Encyclopedia of Artificial Intelligence. In Combining Classifiers and Learning 
Mixture-of-Experts; Rabunal-Dopico, J. R., Dorado, J., Pazos A. Eds., IGI Global (IGI) 
publishing company: Hershey, PA, USA, 2008; pp. 318-326. 

27. Pajares, G.; Guijarro, M.; Herrera, P.J.; Ribeiro, A. IET Comput. Vision doi: 10.1049/iet- 
cvi.2008.0023, 2009, in press. 

28. Pajares, G.; Guijarro, M.; Herrera, P.J.; Ribeiro, A. A hopfield neural network for combining 
classifiers applied to textured images. Neural Networks; doi:10.1016/j.neunet.2009.07.019, 2009, 
in press. 

29. Haykin, S. Neural Networks: a comprehensive foundation; Macmillan College Publishing Co.: 
New York, NY, USA, 1994. 

30. Bezdek, J.C. Pattern Recognition with Fuzzy Objective Function Algorithms; Kluwer-Plenum 
Press: New York, NY, USA, 1981. 

31. Kirkpatrick, S.; Gelatt, CD.; Vecchi, M.P. Optimization by simulated annealing. Science 1983, 
220, 671-680. 

32. Kirkpatrick, S. Optimization by simulated annealing: quantitative studies. J. Statist. Phys. 1984, 
34, 975-984. 

33. Hajek, B. Cooling schedules for optimal annealing. Math. Oper. Res. 1988, 13, 31 1-329. 

34. Laarhoven, P.M.J. ; Aarts, E.H.L. Simulated Annealing: Theory and Applications, Kluwer 
Academic: Norwell, MA, USA, 1989. 

35. Asuncion, A.; Newman, D.J. UCI Machine Learning Repository. University of California, School 
of Information and Computer Science: Irvine, CA, USA; website http://archive.ics.uci.edu/ml/ 
(accessed September 7, 2009). 



Sensors 2009, 9 



7149 



36. Cabrera, J.B.D. On the impact of fusion strategies on classification errors for large ensambles of 
classifiers. Patt. Recog. 2006, 39, 1963-1978. 

37. Yager, R.R. On ordered weighted averaging aggregation operators in multicriteria decision 
making. IEEE Trans. Syst. Man Cybern. 1988, 18, 183-190. 

38. Tax, D.M.J.; Breukelen, M.; Duin, R.P.W.; Kittler, J. Combining multiple classifiers by averaging 
or by multiplying? Patt. Recog. 2000, 33, 1475-1485. 

39. Littmann, E.; Ritter, H. Adaptive color segmentation -A comparison of neural and statistical 
methods. IEEE Trans. Neural Networks 1997, 8, 175-185. 

40. Cheng, H.D.; Jiang, X. H.; Sun, Y.; Wang, J. Color image segmentation: advances and prospects, 
Patt. Recog. 2001, 34, 2259-2281. 



© 2009 by the authors; licensee Molecular Diversity Preservation International, Basel, Switzerland. 
This article is an open-access article distributed under the terms and conditions of the Creative 
Commons Attribution license (http://creativecommons.Org/licenses/by/3.0/). 



